For STA 9750 Mini Project 2, I’m going to propose remaking a classic movie: Independence Day.
Watch out, Earth!
Pitch
Originally released in 1996 and starring Will Smith, this film depicts Earth’s fight for freedom from an alien invasion.
Starring Jacob Batalon in his breakout role as Captain Steven Hiller and famous robo-freedom-fighter John DiMaggio as David Levinson, this film follows an ensemble of characters as they look to save Earth from the new threat from the aliens of Omicron Persei 10.
Jacob Batalon
John DiMaggio
Directed by Glass Onion and Knives Out director Rian Johnson, the action behind this film belies the mystery of the Omicronians and their true plot.
Rian Johnson
How we got here
Now that I’ve thoroughly grabbed your attention, how the heck did we come up with this cast of characters?
After sorting through IMDb data, these three just truly spoke to us and we knew we were on to something.
Task 0: In which the data is gathered
Since we’re doing this in R, let’s get some code going
Rows: 13884604 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): nconst, primaryName, birthYear, deathYear, primaryProfession, known...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 11172251 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): tconst, titleType, primaryTitle, originalTitle, startYear, endYear,...
dbl (1): isAdult
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 8579083 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): tconst, parentTconst, seasonNumber, episodeNumber
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1488767 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): tconst
dbl (2): averageRating, numVotes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10514405 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): tconst, directors, writers
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 81376330 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): tconst, nconst, category, job, characters
dbl (1): ordering
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Looking at the data we just pulled
Code
TITLE_RATINGS |>ggplot(aes(x=numVotes)) +geom_histogram(bins=30) +xlab("Number of IMDB Ratings") +ylab("Number of Titles") +ggtitle("Majority of IMDB Titles Have Less than 100 Ratings") +theme_bw() +scale_x_log10(label=scales::comma) +scale_y_continuous(label=scales::comma)
We have several tasks to complete with this data, including some EDA!
Task 1: Clean Clean Clean
Code
convert_columns <-function(data, columns, conversion ="numeric") {# Check if conversion is validif(!conversion %in%c("numeric", "logical")) {stop("Invalid conversion type. Choose either 'numeric' or 'logical'.") }# Apply the conversion based on argumentif (conversion =="numeric") { data <- data |>mutate(across(all_of(columns), as.numeric)) } elseif (conversion =="logical") { data <- data |>mutate(across(all_of(columns), as.logical)) }return(data)}TITLE_BASICS <-convert_columns(TITLE_BASICS, columns =c("startYear","endYear","runtimeMinutes"), conversion="numeric")TITLE_EPISODES <-convert_columns(TITLE_EPISODES, columns=c("seasonNumber","episodeNumber"), conversion="numeric")
Code
# Task 2: Provided Questions, Provided Answers#Here we seek to answer several questions from the dataset.#2.1: How many movies are in our data set? How many TV series? How many TV episodes?# TITLE_BASICS will have this answerglimpse(TITLE_BASICS)
movie short tvEpisode tvMiniSeries tvMovie tvSeries
132246 16735 156768 5932 15048 29992
tvShort tvSpecial video videoGame
411 3064 9345 4690
Code
# movie, tvEpisode and tvSeriesMOVIE_COUNT <- TITLE_BASICS |>filter(`titleType`=="movie") |>select(`tconst`) |>unique() |>count()sprintf("There are %s movies in the data",format(MOVIE_COUNT,big.mark=",", scientific=FALSE))
[1] "There are # A tibble: 1 × 1 movies in the data"
[2] "There are n movies in the data"
[3] "There are <int> movies in the data"
[4] "There are 1 132246 movies in the data"
Code
TV_SERIES_COUNT <- TITLE_BASICS |>filter(`titleType`=="tvSeries") |>select(`tconst`) |>unique() |>count()sprintf("There are %s TV Series in the data",format(TV_SERIES_COUNT,big.mark=",", scientific=FALSE))
[1] "There are # A tibble: 1 × 1 TV Series in the data"
[2] "There are n TV Series in the data"
[3] "There are <int> TV Series in the data"
[4] "There are 1 29992 TV Series in the data"
Code
TV_EPISODE_COUNT <- TITLE_BASICS |>filter(`titleType`=="tvEpisode") |>select(`tconst`) |>unique() |>count()sprintf("There are %s TV Episodes in the data",format(TV_EPISODE_COUNT,big.mark=",", scientific=FALSE))
[1] "There are # A tibble: 1 × 1 TV Episodes in the data"
[2] "There are n TV Episodes in the data"
[3] "There are <int> TV Episodes in the data"
[4] "There are 1 156768 TV Episodes in the data"
Code
#2.2: Oldest living person in dataglimpse(NAME_BASICS)
# A tibble: 10 × 3
primaryName birthYear deathYear
<chr> <dbl> <dbl>
1 Traudl Lessing 1625 NA
2 Robert De Visée 1655 NA
3 William Sandys 1767 NA
4 Richard Dybeck 1811 NA
5 Albert Monnier 1815 NA
6 C. Hostrup 1818 NA
7 Edouard Martin 1825 NA
8 Gustav Lange 1830 NA
9 William S. Hooser 1836 NA
10 Ion Ivanovici 1845 NA
Code
# Clearly this isn't the way to go about thisglimpse(TITLE_BASICS)
TITLES_AND_NAMES <- NAME_BASICS |>separate_longer_delim(knownForTitles, delim =",") |>inner_join(TITLE_BASICS, by =c("knownForTitles"="tconst")) |>filter(is.na(deathYear),birthYear >=1908) |>#according to google, the current oldest person alive was born in 1908arrange(birthYear) |>select(primaryName, birthYear, deathYear) |>unique() |>slice_head(n =5)TITLES_AND_NAMES # Because there are too many NULL deathYear values, this question can't really be answered
# A tibble: 5 × 3
primaryName birthYear deathYear
<chr> <dbl> <dbl>
1 Robert Aisner 1908 NA
2 Ugo Amadoro 1908 NA
3 Hanna Bark 1908 NA
4 Harry Belock 1908 NA
5 Aida Broadbent 1908 NA
Code
#2.3: There is one TV Episode in this data set with a perfect 10/10 rating and at least 200,00 IMDb ratings. What is it? Which series does it belong to?glimpse(TITLE_RATINGS)
HIGHEST_RATED <- TITLE_RATINGS |>filter(numVotes >=200000) |>slice_max(order_by = averageRating) |>left_join(TITLE_EPISODES, by =c("tconst"="tconst")) |>inner_join(TITLE_BASICS, by =c("tconst"="tconst")) |>left_join(TITLE_BASICS, by =c("parentTconst"="tconst")) |>select(filmID =`tconst`,seriesName =`primaryTitle.y`,episodeName =`primaryTitle.x`,`seasonNumber`,`episodeNumber`,yearAired =`startYear.x`,`averageRating`,`numVotes` )HIGHEST_RATED # Unsurprisingly, it's a season 5 episode of Breaking Bad, specifically, the one where Hank gets got
If anyone’s interested in a famous scene (and subsequent meme) from this episode (Warning: violence and spoilers)
Code
#2.4: Which 4 projects is actor Mark Hamill most known for?# Just guessing beforehand: Star War IV, V, VI, and VA for the Joker in the Batman animated seriesglimpse(NAME_BASICS)
MARK_HAMILL <- NAME_BASICS |>filter(primaryName =="Mark Hamill")MARK_HAMILL # going to guess the record with multiple titles is the correct Mark Hamill
# A tibble: 1 × 6
nconst primaryName birthYear deathYear primaryProfession knownForTitles
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 nm0000434 Mark Hamill NA NA actor,producer,direc… tt0076759,tt2…
# A tibble: 4 × 3
actorID primaryName primaryTitle
<chr> <chr> <chr>
1 nm0000434 Mark Hamill Star Wars: Episode IV - A New Hope
2 nm0000434 Mark Hamill Star Wars: Episode VIII - The Last Jedi
3 nm0000434 Mark Hamill Star Wars: Episode V - The Empire Strikes Back
4 nm0000434 Mark Hamill Star Wars: Episode VI - Return of the Jedi
Apparently, Star Wars Episode VIII gets a listing before his Joker VA. Hard disagree, but I digress.
Code
#2.5: What TV series, with more than 12 episodes, has the highest average rating?glimpse(TITLE_EPISODES)
# Yep, this returned all the BB episodesTWELVE_EP_SERIES <- TITLE_EPISODES |>group_by(parentTconst) |>summarize(episodeCount =n()) |>filter(episodeCount >=12) |>inner_join(TITLE_RATINGS,c("parentTconst"="tconst")) |>left_join(TITLE_BASICS,c("parentTconst"="tconst")) |>select(seriesID =`parentTconst`,`primaryTitle`,`startYear`,`endYear`,`episodeCount`,`averageRating`,`numVotes` ) |>arrange(desc(averageRating), desc(episodeCount), desc(numVotes)) |>slice_head(n=5)TWELVE_EP_SERIES # I have no clue what any of these shows are
# A tibble: 5 × 7
seriesID primaryTitle startYear endYear episodeCount averageRating numVotes
<chr> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 tt15613780 Craft Games 2014 NA 318 9.7 150
2 tt8560108 Prime Time 2016 NA 212 9.7 181
3 tt21989170 Gumbino 2016 NA 148 9.7 343
4 tt7151672 Choufli Hal 2005 2009 134 9.7 2930
5 tt23028046 Jogandofodda… 2019 2021 101 9.7 168
Code
#2.6: The TV Series Happy Days (1974-1984) gives us the common idiom "jump the shark". The phrase comes from# a controversial fifth season episode (aired in 1977) in which a lead character literally jumped over a shark on# water skis. Idiomatically, it is used to refer to the moment when a once-great show becomes ridiculous and# rapidly loses quality. Is it true that episodes from later seasons of Happy Days have lower# average ratings than the early seasons?# First find the ID for Happy Daysglimpse(TITLE_BASICS)
min_point <- HAPPY_DAYS[which.min(HAPPY_DAYS$avgRating), ]max_point <- HAPPY_DAYS[which.max(HAPPY_DAYS$avgRating), ]sharkjump <- HAPPY_DAYS[HAPPY_DAYS$seasonNumber ==5, ]highlighted_points <-factor(c("Lowest Rating", "Highest Rating", "Jumped Shark"),levels =c("Lowest Rating", "Highest Rating", "Jumped Shark")) # Set the order here# Plot ratings over timeggplot(HAPPY_DAYS, aes(x = seasonNumber, y = avgRating)) +geom_line(size=1) +geom_point(color ="#D35400",size=2) +# Add custom points for min, max, and season 5geom_point(data = min_point, aes(color = highlighted_points[1]), size =3) +geom_point(data = max_point, aes(color = highlighted_points[2]), size =3) +geom_point(data = sharkjump, aes(color = highlighted_points[3]), size =3) +# Add labels and formattinglabs(title ="Happy Days Average Rating by Season", x ="Season", y ="Average Rating") +# Customize ticks for x and y axesscale_x_continuous(breaks =seq(min(HAPPY_DAYS$seasonNumber), max(HAPPY_DAYS$seasonNumber), by =1)) +scale_y_continuous(labels =label_number(), breaks =seq(floor(min(HAPPY_DAYS$avgRating)), ceiling(max(HAPPY_DAYS$avgRating)), by =0.5)) +# Manually define colors for the highlighted pointsscale_color_manual(values =c("Lowest Rating"="blue", "Highest Rating"="red", "Jumped Shark"="green"),name ="" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =16, face ="bold"), axis.title.x =element_text(size =14), axis.title.y =element_text(size =14), axis.text =element_text(size =12), # panel.grid.minor = element_blank(), legend.position ="right", # Position the legend on the rightlegend.text =element_text(face ="bold", size =8),# Set the background colorspanel.background =element_rect(fill ="gray90"), # Gray background for the plot areaplot.background =element_rect(fill ="gray90"), # Gray background for the entire plot# Enhance axis linesaxis.line =element_line(color ="black", size =1.2), # Change color and thickness of axis linesaxis.ticks =element_line(color ="black", size =1), # Change color and thickness of tick marksaxis.ticks.length =unit(0.25, "cm"), # Adjust length of tick marks# Enhance gridlinespanel.grid.major =element_line(color ="gray60", size =0.5), # Change color and thickness of major gridlinespanel.grid.minor =element_line(color ="gray60", size =0.5) # Change color and thickness of minor gridlines )
As we can see, the shark-jump likely caused an immediate dip in viewership, but it rebounded for the final 3 seasons and went out near the height of its popularity
Task 3: What is success?
Here we create a “success” metric to determine whether or not a film meets our own standards.
First let’s define success. We want to look at two features: * Average Rating * Num Votes Other features of note could be * Crew Size * Run Time
Code
# First let's just get only movie data from 1960 onwardsMOVIES <- TITLE_BASICS |>filter(titleType =="movie",startYear >=1960)# Add in ratingsMOVIES <- MOVIES |>left_join(TITLE_RATINGS, c("tconst"="tconst")) |>select(`tconst`,`primaryTitle`,`isAdult`,releaseYear =`startYear`,`runtimeMinutes`,`genres`,`averageRating`,`numVotes` )# Add in crew sizeCREWS <- TITLE_PRINCIPALS |>group_by(tconst) |>summarize(castCount =n())MOVIES <- MOVIES |>left_join(CREWS, c("tconst"="tconst"))# Count the number of genres in the 'genres' columnGENRE_COUNT <- MOVIES |>separate_longer_delim(genres, delim =",") |># Split into multiple rows by delimitergroup_by(tconst) |># Group by title to keep track of original rowssummarise(genre_count =n(), .groups ='drop') # Count the number of genres# Join back to the original MOVIES dataframe if neededMOVIES <- MOVIES |>left_join(GENRE_COUNT, by ="tconst")# summary statsdescribe(MOVIES)
# 90% of movies fall between an average rating of 3.4 and 7.7quantile(MOVIES$averageRating,.05)
5%
3.4
Code
quantile(MOVIES$averageRating,.95)
95%
7.7
Code
# We can use this to determine both what makes a great move and what makes a flopFLOPS <- MOVIES |>filter(averageRating <=3.4)SUCCESSES <- MOVIES |>filter(averageRating >=7.7)describe(FLOPS)
# Successes seem to be longer on average but also confined to a range of 1,440 minutes runtime while flops get a little silly# A lot more people vote for successes (surprising because you'd think people are more likely to respond if they dislike something than if they like it)# https://www.nationalstrategic.com/why-would-they-write-that-the-psychology-of-customer-reviews/# The number of genres used seem to not matter pretty much at allMode <-function(x) { ux <-unique(x) ux[which.max(tabulate(match(x, ux)))]}Mode(SUCCESSES$genres)
Documentaries and Dramas are successfull while Horror is not. Comedy can go either way.
Metrics for success: * Good genre (Drama, Documentary) * High number of votes (> 500) * Runtime between 70 minutes and 150 minutes. Dock points for each STD outside of that Metrics for flop: * Bad genre (Horror, Comedy) * Low number of votes (<= 500) * Runtime greater than 200 minutes
Code
# Define the metric for success functionmetric_for_success <-function(df) { df <- df |>mutate(# Calculate positive contributionspositive_metric =ifelse(str_detect(genres, "Drama|Documentary"), 0.2, 0) +ifelse(numVotes >=500, 0.4, 0) +ifelse(runtimeMinutes >=70& runtimeMinutes <=150, 0.4, 0),# Calculate negative contributionsnegative_metric =ifelse(str_detect(genres, "Horror|Comedy"), -0.4, 0) +ifelse(numVotes <500|is.na(numVotes), -0.3, 0) +ifelse(runtimeMinutes >200, -pmin(0.5, (runtimeMinutes -200) /90), 0),# Combine the positive and negative contributionsmetric = positive_metric + negative_metric,# Bound the metric between -1 and 1metric =pmin(pmax(metric, -1), 1),# Round the metric to 3 decimal placesmetric =round(metric, 3) ) |># Drop the intermediate columns (optional)select(-positive_metric, -negative_metric)return(df)}#Apply the metric function to the DataFrameMOVIES <-metric_for_success(MOVIES)table(MOVIES$metric)
#3.2: Choose 3-5 movies with large numbers of IMDb votes that socre poorly on your success metric and confirm# that they are indeed low qualityMOVIES |>arrange(metric, desc(numVotes)) |>select(`primaryTitle`,`releaseYear`,`numVotes`,`averageRating`,`genres`,`metric`) |>slice_head(n=5)
I didn’t need a personal metric to tell me that these 5 movies suck
Code
#3.3 Choose a prestige actor or director and confirm that they have many projects with high scores on your success metric.NAME_BASICS |>filter(primaryName =="Stanley Kubrick") |>separate_longer_delim(knownForTitles, delim =",") |>inner_join(MOVIES, by =c("knownForTitles"="tconst")) |>select(`primaryName`,`primaryTitle`,`averageRating`,`numVotes`,`genres`,`runtimeMinutes`,`metric` )
# A tibble: 4 × 7
primaryName primaryTitle averageRating numVotes genres runtimeMinutes metric
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Stanley Kubr… 2001: A Spa… 8.3 733226 Adven… 149 0.8
2 Stanley Kubr… A Clockwork… 8.2 894278 Crime… 136 0.8
3 Stanley Kubr… Barry Lyndon 8.1 187061 Adven… 185 0.6
4 Stanley Kubr… Full Metal … 8.2 804176 Drama… 116 1
Although the metric is imperfect, it is pretty good at telling us what’s good and what’s not (based on average ratings)
#3.5: Come up with a numerical threshold for a project to be a ‘success’; that is, determine a value such that movies above `v` are all “solid” or better.v <-0.6
Task 4: What makes a good movie?
Code
# Juuuust incaseavgrtgthrsh <-6.0# Create a new column indicating success based on averageRatingMOVIES <- MOVIES |>mutate(success_averageRating =ifelse(averageRating >= avgrtgthrsh, 1, 0),success_metric =ifelse(metric >= v, 1, 0)) |>mutate(decade =floor(releaseYear /10) *10)MOVIES_GENRES <- MOVIES |>filter(genres !="\\N", averageRating !="\\N") |>separate_longer_delim(genres, delim =",") |>mutate(genres =trimws(genres)) |>mutate(decade =floor(releaseYear /10) *10)# Calculate total films by genre and decadetotal_films_by_genre <- MOVIES_GENRES |>group_by(decade, genres) |>summarize(total_count =n(), .groups ="drop")# Filter genres by countpopular_genres <- total_films_by_genre |>filter(total_count >5000) |>select(genres) |>distinct()# Count successes for averageRating (for popular genres)success_by_genre_rating_popular <- MOVIES_GENRES |>filter(genres %in% popular_genres$genres) |>group_by(decade, genres) |>summarize(success_count =sum(success_averageRating, na.rm =TRUE), .groups ="drop") |>left_join(total_films_by_genre, by =c("decade", "genres")) |>mutate(success_rate = success_count / total_count *100) |>arrange(decade, desc(success_rate))# Count successes for metric (for popular genres)success_by_genre_metric_popular <- MOVIES_GENRES |>filter(genres %in% popular_genres$genres) |>group_by(decade, genres) |>summarize(success_count =sum(success_metric, na.rm =TRUE), .groups ="drop") |>left_join(total_films_by_genre, by =c("decade", "genres")) |>mutate(success_rate = success_count / total_count *100) |>arrange(decade, desc(success_rate))# Count successes for averageRating (for less popular genres)success_by_genre_rating_less <- MOVIES_GENRES |>filter(!(genres %in% popular_genres$genres)) |>group_by(decade, genres) |>summarize(success_count =sum(success_averageRating, na.rm =TRUE), .groups ="drop") |>left_join(total_films_by_genre, by =c("decade", "genres")) |>mutate(success_rate = success_count / total_count *100) |>arrange(decade, desc(success_rate))# Count successes for metric (for less popular genres)success_by_genre_metric_less <- MOVIES_GENRES |>filter(!(genres %in% popular_genres$genres)) |>group_by(decade, genres) |>summarize(success_count =sum(success_metric, na.rm =TRUE), .groups ="drop") |>left_join(total_films_by_genre, by =c("decade", "genres")) |>mutate(success_rate = success_count / total_count *100) |>arrange(decade, desc(success_rate))# Plotting Success Rates for Average Rating (Popular Genres)p1 <-ggplot(success_by_genre_rating_popular, aes(x = decade, y = success_rate, color = genres, group = genres)) +geom_line(size =1) +# Add linegeom_point(size =3) +# Add points for claritylabs(title ="Success Rates by Genre (Average Rating - Popular Genres)", x ="Decade", y ="Success Rate (%)") +scale_y_continuous(limits =c(0, 100)) +# Set y-axis limits from 0 to 100theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labelsplot.title =element_text(hjust =0.5) # Center the title )# Convert ggplot to plotly for hover functionalityp1_interactive <-ggplotly(p1)# Plotting Success Rates for Metric (Popular Genres)p2 <-ggplot(success_by_genre_metric_popular, aes(x = decade, y = success_rate, color = genres, group = genres)) +geom_line(size =1) +# Add linegeom_point(size =3) +# Add points for claritylabs(title ="Success Rates by Genre (Metric - Popular Genres)", x ="Decade", y ="Success Rate (%)") +scale_y_continuous(limits =c(0, 100)) +# Set y-axis limits from 0 to 100theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labelsplot.title =element_text(hjust =0.5) # Center the title )# Convert ggplot to plotly for hover functionalityp2_interactive <-ggplotly(p2)# Plotting Success Rates for Average Rating (Less Popular Genres)p3 <-ggplot(success_by_genre_rating_less, aes(x = decade, y = success_rate, color = genres, group = genres)) +geom_line(size =1) +# Add linegeom_point(size =3) +# Add points for claritylabs(title ="Success Rates by Genre (Average Rating - Less Popular Genres)", x ="Decade", y ="Success Rate (%)") +scale_y_continuous(limits =c(0, 100)) +# Set y-axis limits from 0 to 100theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labelsplot.title =element_text(hjust =0.5) # Center the title )# Convert ggplot to plotly for hover functionalityp3_interactive <-ggplotly(p3)# Plotting Success Rates for Metric (Less Popular Genres)p4 <-ggplot(success_by_genre_metric_less, aes(x = decade, y = success_rate, color = genres, group = genres)) +geom_line(size =1) +# Add linegeom_point(size =3) +# Add points for claritylabs(title ="Success Rates by Genre (Metric - Less Popular Genres)", x ="Decade", y ="Success Rate (%)") +scale_y_continuous(limits =c(0, 100)) +# Set y-axis limits from 0 to 100theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labelsplot.title =element_text(hjust =0.5) # Center the title )# Convert ggplot to plotly for hover functionalityp4_interactive <-ggplotly(p4)
Code
p1_interactive
Code
p2_interactive
Code
p3_interactive
Code
p4_interactive
They’re interactive!
Code
#4.1: Which genre has the most "successs" in each decade"?# Count successes for averageRatingsuccess_by_genre_rating <- MOVIES_GENRES |>group_by(decade, genres) |>summarize(success_count =sum(success_averageRating, na.rm =TRUE), .groups ="drop") |>arrange(decade, desc(success_count))# Count successes for metricsuccess_by_genre_metric <- MOVIES_GENRES |>group_by(decade, genres) |>summarize(success_count =sum(success_metric, na.rm =TRUE), .groups ="drop") |>arrange(decade, desc(success_count))# Identify the genre with the most successes for each decade (Average Rating)most_successful_genre_rating <- success_by_genre_rating |>group_by(decade) |>slice_max(success_count, n =2) |>ungroup()# Identify the genre with the most successes for each decade (Metric)most_successful_genre_metric <- success_by_genre_metric |>group_by(decade) |>slice_max(success_count, n =2) |>ungroup()most_successful_genre_rating
# A tibble: 14 × 3
decade genres success_count
<dbl> <chr> <dbl>
1 1960 Drama 2567
2 1960 Comedy 1287
3 1970 Drama 2990
4 1970 Comedy 1318
5 1980 Drama 3220
6 1980 Comedy 1783
7 1990 Drama 4170
8 1990 Comedy 2184
9 2000 Drama 7339
10 2000 Comedy 3325
11 2010 Drama 11449
12 2010 Documentary 5084
13 2020 Drama 5103
14 2020 Comedy 2174
Code
most_successful_genre_metric
# A tibble: 15 × 3
decade genres success_count
<dbl> <chr> <dbl>
1 1960 Drama 1526
2 1960 Adventure 308
3 1960 Romance 308
4 1970 Drama 1890
5 1970 Crime 517
6 1980 Drama 2083
7 1980 Action 600
8 1990 Drama 3258
9 1990 Comedy 888
10 2000 Drama 6342
11 2000 Comedy 1711
12 2010 Drama 10358
13 2010 Action 2553
14 2020 Drama 4501
15 2020 Thriller 1319
Drama and Action films are pretty succesfull!
Code
#4.2 Which genre consistently has the most "successes"? Which genre used to reliably produce "successes" and has fallen out of favor?p1_interactive
Code
p2_interactive
Documentaries have the most consistent successes.
Horror used to produce a lot of successes but has been on a downward trend each decade.
Code
#4.3: What genre has produced the most “successes” since 2010?#Does it have the highest success rate or does it only have a large number of successes because there are many productions in that genre?p1_interactive
Code
p2_interactive
Code
most_successful_genre_rating
# A tibble: 14 × 3
decade genres success_count
<dbl> <chr> <dbl>
1 1960 Drama 2567
2 1960 Comedy 1287
3 1970 Drama 2990
4 1970 Comedy 1318
5 1980 Drama 3220
6 1980 Comedy 1783
7 1990 Drama 4170
8 1990 Comedy 2184
9 2000 Drama 7339
10 2000 Comedy 3325
11 2010 Drama 11449
12 2010 Documentary 5084
13 2020 Drama 5103
14 2020 Comedy 2174
Code
most_successful_genre_metric
# A tibble: 15 × 3
decade genres success_count
<dbl> <chr> <dbl>
1 1960 Drama 1526
2 1960 Adventure 308
3 1960 Romance 308
4 1970 Drama 1890
5 1970 Crime 517
6 1980 Drama 2083
7 1980 Action 600
8 1990 Drama 3258
9 1990 Comedy 888
10 2000 Drama 6342
11 2000 Comedy 1711
12 2010 Drama 10358
13 2010 Action 2553
14 2020 Drama 4501
15 2020 Thriller 1319
Drama seems to be the answer to both of these
Code
#4.4: What genre has become more popular in recent years?p3_interactive
Code
p4_interactive
Action movies
Task 5: Identifying the Crew
Hang on to your hats because there’s about to be a lot of data manipulation.
Basically, we’re going to whittle down the list of actors and directors until we find a group that matches what we’re looking for in producing the next box-office smash hit.
Code
# #Filter TITLE_BASICS based on metric threshold and genreFILTERED_MOVIES <- MOVIES |>separate_longer_delim(genres, delim =",") |>filter(success_metric ==1, genres %in%c("Action", "Drama", "Thriller", "Documentary")) |>group_by(tconst) |>summarize(title =first(primaryTitle), # Adjust according to your datasetyear =first(releaseYear), # Adjust according to your datasetmetric =first(metric), # Adjust according to your dataset,avgRating =first(averageRating), numVotes,decade =first(decade),castCount =first(castCount),genres =paste(unique(genres), collapse =", ") # Combine genres into one string )ACTORS <- NAME_BASICS |>filter(grepl("actor", primaryProfession, ignore.case =TRUE), is.na(deathYear)) |>separate_longer_delim(primaryProfession, delim =",") |>filter(primaryProfession =="actor") |>mutate(age =2024- birthYear) |>select(-`deathYear`)# Joining with TITLE_PRINCIPALSACTORS_TITLES <- ACTORS |>inner_join(TITLE_PRINCIPALS, by =c("nconst"="nconst")) |>left_join(TITLE_BASICS, by =c("tconst"="tconst")) |>inner_join(FILTERED_MOVIES, by =c("tconst"="tconst")) |>select(`nconst`,`primaryName`,`age`,`tconst`,`primaryTitle`,releaseYear =`startYear`,`decade`,`castCount`,`runtimeMinutes`,genres =`genres.x`,mainGenre =`genres.y`,`metric`,`avgRating` ) |>distinct()DIRECTORS <- NAME_BASICS |>filter(grepl("director", primaryProfession, ignore.case =TRUE), is.na(deathYear)) |>separate_longer_delim(primaryProfession, delim =",") |>filter(primaryProfession =="director") |>mutate(age =2024- birthYear) |>select(-`deathYear`)# Joining with TITLE_PRINCIPALSDIRECTORS_TITLES <- DIRECTORS |>inner_join(TITLE_PRINCIPALS, by =c("nconst"="nconst")) |>left_join(TITLE_BASICS, by =c("tconst"="tconst")) |>inner_join(FILTERED_MOVIES, by =c("tconst"="tconst")) |>select(`nconst`,`primaryName`,`age`,`tconst`,`primaryTitle`,releaseYear =`startYear`,`decade`,`castCount`,`runtimeMinutes`,genres =`genres.x`,mainGenre =`genres.y`,`metric`,`avgRating` ) |>distinct()# Let's get anyone who's been in a film since at least 2015LAST_TEN_YEARS_ACTORS <- ACTORS_TITLES |>filter(releaseYear >=2015)LAST_TEN_YEARS_DIRECTORS <- DIRECTORS_TITLES |>filter(releaseYear >=2015)# since Action and Drama are the best genres, let's subset to thoseBEST_GENRES_ACTORS <- LAST_TEN_YEARS_ACTORS |>filter(mainGenre %in%c("Action", "Drama"))BEST_GENRES_DIRECTORS <- LAST_TEN_YEARS_DIRECTORS |>filter(mainGenre %in%c("Action", "Drama"))# For actors, let's get people who can work with big supporting casts. # We don't know if that's what we want, but it'll be nice to havecast_threshold <- BEST_GENRES_ACTORS |>summarize(threshold =quantile(castCount, 0.95, na.rm =TRUE)) |>pull(threshold)ACTORS_CREWS <- BEST_GENRES_ACTORS |>group_by(nconst) |>summarize(median_castCount =median(castCount, na.rm =TRUE)) |>filter(median_castCount >= cast_threshold) |># Filter for median castCount in the top 5 percentilEungroup()ACTORS_CAST <- BEST_GENRES_ACTORS |>inner_join(ACTORS_CREWS, by =c("nconst"="nconst")) |>select(-`median_castCount` ) |>distinct()# Let's pull their best works onlyBEST_FILMS_ACTORS <- ACTORS_CAST |>group_by(nconst) |>filter(avgRating ==max(avgRating, na.rm =TRUE)) |>ungroup()# For actors, let's narrow it down to actors whos best film is from this decadeCURRENT_BEST_ACTORS <- BEST_FILMS_ACTORS |>filter(decade==2020)# We'll now whittle down directors. Who can get a runtime in our sweet-spot of 70 to 150 minutes?filtered_median_runtime <- BEST_GENRES_DIRECTORS |>group_by(nconst) |>summarize(median_runtime =median(runtimeMinutes, na.rm =TRUE)) |># Calculate median runtimefilter(median_runtime >=70& median_runtime <=150) # Filter for median runtime between 70 and 150# Doing the same but for directorsDIRECTORS_RUNTIME <- BEST_GENRES_DIRECTORS |>inner_join(filtered_median_runtime, by =c("nconst"="nconst"))# let's pull their best movies onlyBEST_FILMS_DIRECTORS <- DIRECTORS_RUNTIME |>group_by(nconst) |>filter(avgRating ==max(avgRating, na.rm =TRUE)) |>ungroup()# Let's remove any director who is also an actorNON_ACTOR_DIRECTORS <- BEST_FILMS_DIRECTORS |>anti_join(BEST_FILMS_ACTORS, by =c("nconst"="nconst"))# For directors, let's narrow it down to actors whos best film is from this decadeCURRENT_BEST_DIRECTORS <- NON_ACTOR_DIRECTORS |>filter(decade==2020)
Code
dim(NON_ACTOR_DIRECTORS)
[1] 15530 14
Code
dim(BEST_FILMS_ACTORS)
[1] 1371 13
Taking a quick breather! We just got out data down quite a bit. Let’s keep going.
Code
MOST_POPULAR_ACTORS <- CURRENT_BEST_ACTORS |>left_join(FILTERED_MOVIES, by =c("tconst"="tconst")) |>select(`nconst`,`primaryName`,`age`,`primaryTitle`,`year`,`mainGenre`,avgRating =`avgRating.x`,`numVotes` ) |>arrange(desc(numVotes))# Let's also bring in how many films they've worked onACTOR_FILM_CT <- TITLE_PRINCIPALS |>filter(category =="actor") |>group_by(nconst) |>summarize(film_count =n())MOST_POPULAR_ACTORS <- MOST_POPULAR_ACTORS |>left_join(ACTOR_FILM_CT, by =c("nconst"="nconst"))MOST_POPULAR_ACTORS <- MOST_POPULAR_ACTORS |>filter(!is.na(film_count))head(MOST_POPULAR_ACTORS)
# A tibble: 6 × 9
nconst primaryName age primaryTitle year mainGenre avgRating numVotes
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 nm8188622 Jacob Batalon 28 Spider-Man:… 2021 Action 8.2 910046
2 nm1069587 Darshan Kuma… 38 The Kashmir… 2022 Drama 8.6 575254
3 nm12220302 Sourav Verma NA The Kashmir… 2022 Drama 8.6 575254
4 nm1282966 O-T Fagbenle 43 Black Widow 2021 Action 6.6 437506
5 nm0000439 Neil Patrick… 51 The Matrix … 2021 Action 5.6 286445
6 nm1186373 Adam Sztykiel 46 Black Adam 2022 Action 6.2 282231
# ℹ 1 more variable: film_count <int>
Code
# Step 1: Join MOST_POPULAR_ACTORS with TITLE_PRINCIPALS to get tconst (movies the actor was part of)actor_movies <- MOST_POPULAR_ACTORS |>inner_join(TITLE_PRINCIPALS, by ="nconst") |>select(nconst, tconst)# Step 2: Join actor_movies with MOVIES to get the average rating of each movieactor_movie_ratings <- actor_movies |>inner_join(MOVIES, by ="tconst") |>select(nconst, tconst, averageRating)# Step 3: Calculate the average rating for each actor across their moviesactor_avg_rating <- actor_movie_ratings |>group_by(nconst) |>summarize(avg_rating =mean(averageRating, na.rm =TRUE))# Step 4: Join the calculated average rating back to MOST_POPULAR_ACTORSMOST_POPULAR_ACTORS <- MOST_POPULAR_ACTORS |>left_join(actor_avg_rating, by ="nconst")MOST_POPULAR_ACTORS <- MOST_POPULAR_ACTORS |>select(`nconst`,`primaryName`,`age`,`primaryTitle`,`year`,`mainGenre`,avgMainMovieRating =`avgRating`,`numVotes`,filmCount =`film_count`,avgActorRating =`avg_rating` )MOST_POPULAR_DIRECTORS <- CURRENT_BEST_DIRECTORS |>left_join(FILTERED_MOVIES, by =c("tconst"="tconst")) |>select(`nconst`,`primaryName`,`age`,`primaryTitle`,`year`,`mainGenre`,avgRating =`avgRating.x`,`numVotes` ) |>arrange(desc(numVotes))# Let's also bring in how many films they've worked onDIRECTOR_FILM_CT <- TITLE_PRINCIPALS |>filter(category =="director") |>group_by(nconst) |>summarize(film_count =n())MOST_POPULAR_DIRECTORS <- MOST_POPULAR_DIRECTORS |>left_join(DIRECTOR_FILM_CT, by =c("nconst"="nconst"))MOST_POPULAR_DIRECTORS <- MOST_POPULAR_DIRECTORS |>filter(!is.na(film_count))# Step 1: Join MOST_POPULAR_DIRECTORS with TITLE_PRINCIPALS to get tconst (movies the director was part of)director_movies <- MOST_POPULAR_DIRECTORS |>inner_join(TITLE_PRINCIPALS, by ="nconst") |>select(nconst, tconst)# Step 2: Join director_movies with MOVIES to get the average rating of each moviedirector_movies_ratings <- director_movies |>inner_join(MOVIES, by ="tconst") |>select(nconst, tconst, averageRating)# Step 3: Calculate the average rating for each director across their moviesdirector_avg_rating <- director_movies_ratings |>group_by(nconst) |>summarize(avg_rating =mean(averageRating, na.rm =TRUE))# Step 4: Join the calculated average rating back to MOST_POPULAR_DIRECTORSMOST_POPULAR_DIRECTORS <- MOST_POPULAR_DIRECTORS |>left_join(director_avg_rating, by ="nconst")MOST_POPULAR_DIRECTORS <- MOST_POPULAR_DIRECTORS |>select(`nconst`,`primaryName`,`age`,`primaryTitle`,`year`,`mainGenre`,avgMainMovieRating =`avgRating`,`numVotes`,filmCount =`film_count`,avgDirectorRating =`avg_rating` )
Code
# Results!head(MOST_POPULAR_ACTORS)
# A tibble: 6 × 10
nconst primaryName age primaryTitle year mainGenre avgMainMovieRating
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 nm8188622 Jacob Batalon 28 Spider-Man:… 2021 Action 8.2
2 nm1069587 Darshan Kuma… 38 The Kashmir… 2022 Drama 8.6
3 nm12220302 Sourav Verma NA The Kashmir… 2022 Drama 8.6
4 nm1282966 O-T Fagbenle 43 Black Widow 2021 Action 6.6
5 nm0000439 Neil Patrick… 51 The Matrix … 2021 Action 5.6
6 nm1186373 Adam Sztykiel 46 Black Adam 2022 Action 6.2
# ℹ 3 more variables: numVotes <dbl>, filmCount <int>, avgActorRating <dbl>
Code
dim(MOST_POPULAR_ACTORS)
[1] 474 10
Code
head(MOST_POPULAR_DIRECTORS)
# A tibble: 6 × 10
nconst primaryName age primaryTitle year mainGenre avgMainMovieRating
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 nm1218281 Jon Watts 43 Spider-Man: N… 2021 Action 8.2
2 nm0001282 Tony Goldwyn 64 Oppenheimer 2023 Drama 8.3
3 nm1347153 Tyler Perry 55 Don't Look Up 2021 Drama 7.2
4 nm0411539 Puneet Issar 65 The Kashmir F… 2022 Drama 8.6
5 nm0426059 Rian Johnson 51 Glass Onion 2022 Drama 7.1
6 nm0781913 Noah Segan 41 Glass Onion 2022 Drama 7.1
# ℹ 3 more variables: numVotes <dbl>, filmCount <int>, avgDirectorRating <dbl>
Code
dim(MOST_POPULAR_DIRECTORS)
[1] 3254 10
Code
# Identify the directors with the highest values for numVotes, filmcount, and avgDirectorRatinghighest_votes <- MOST_POPULAR_DIRECTORS |>filter(numVotes ==max(numVotes, na.rm =TRUE))highest_film_count <- MOST_POPULAR_DIRECTORS |>filter(filmCount ==max(filmCount, na.rm =TRUE))highest_avg_rating <- MOST_POPULAR_DIRECTORS |>filter(avgDirectorRating ==max(avgDirectorRating, na.rm =TRUE))# Combine the labels into one data framelabels_df <-bind_rows(highest_votes, highest_film_count, highest_avg_rating)# Create the scatter plotggplot(MOST_POPULAR_DIRECTORS, aes(x = filmCount, y = numVotes, color = avgDirectorRating)) +geom_point(size =3) +# Add points with size 3scale_color_gradient(low ="blue", high ="red") +# Color gradient from blue (low) to red (high)scale_y_continuous(labels = scales::comma, limits =c(0, 500000)) +# Adjust y-axis to max at 500,000labs(title ="Scatter Plot of Directors' Film Count vs Number of Votes",x ="Film Count",y ="Number of Votes",color ="Avg Director Rating"# Label for the color legend ) +geom_text(data = labels_df, aes(label =paste0(" ", primaryName)), vjust =-1, color ="black", size =4, font ="bold") +# Add labels for the highest pointstheme_minimal(base_size =15) +# Increase font size for better readabilitytheme(plot.background =element_rect(fill ="darkgray"), panel.background =element_rect(fill ="darkgray"), axis.line =element_line(color ="white", size =1.2), axis.text =element_text(color ="white"), # Set axis text color to whiteaxis.title =element_text(color ="white", face ="bold"), # Bolden axis titlesplot.title =element_text(hjust =0.5, color ="white", face ="bold"), # Center the title and bold itaxis.text.x =element_text(angle =45, hjust =1) # Rotate x-axis labels if needed )
Code
MOST_POPULAR_DIRECTORS
# A tibble: 3,254 × 10
nconst primaryName age primaryTitle year mainGenre avgMainMovieRating
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 nm1218281 Jon Watts 43 Spider-Man:… 2021 Action 8.2
2 nm0001282 Tony Goldwyn 64 Oppenheimer 2023 Drama 8.3
3 nm1347153 Tyler Perry 55 Don't Look … 2021 Drama 7.2
4 nm0411539 Puneet Issar 65 The Kashmir… 2022 Drama 8.6
5 nm0426059 Rian Johnson 51 Glass Onion 2022 Drama 7.1
6 nm0781913 Noah Segan 41 Glass Onion 2022 Drama 7.1
7 nm3725055 Jessica Henw… NA Glass Onion 2022 Drama 7.1
8 nm0072600 Ned Benson 47 Black Widow 2021 Action 6.6
9 nm0075244 Gabriel Beri… 75 Black Widow 2021 Action 6.6
10 nm0795153 Cate Shortla… 56 Black Widow 2021 Action 6.6
# ℹ 3,244 more rows
# ℹ 3 more variables: numVotes <dbl>, filmCount <int>, avgDirectorRating <dbl>